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On the Forecast Combination Puzzle 


Wei Qian, Craig A. Rolling* Gang Cheng, Yuhong Yang 


Abstract. It is often reported in forecast combination literature that a simple average 
of candidate forecasts is more robust than sophisticated combining methods. This phe¬ 
nomenon is usually referred to as the “forecast combination puzzle”. Motivated by this 
puzzle, we explore its possible explanations including estimation error, invalid weighting 
formulas and model screening. We show that existing understanding of the puzzle should 
be complemented by the distinction of different forecast combination scenarios known as 
combining for adaptation and combining for improvement. Applying combining meth¬ 
ods without consideration of the underlying scenario can itself cause the puzzle. Based 
on our new understandings, both simulations and real data evaluations are conducted 
to illustrate the causes of the puzzle. We further propose a multi-level AFTER strategy 
that can integrate the strengths of different combining methods and adapt intelligently 
to the underlying scenario. In particular, by treating the simple average as a candidate 
forecast, the proposed strategy is shown to avoid the heavy cost of estimation error and, 
to a large extent, solve the forecast combination puzzle. 

Key Words: combining for adaptation, combining for improvement, multi-level AF¬ 
TER, model selection, structural break 


Introduction 


Since the seminal work of Bates and Granger (1969), both empirical and theoretical 


investigations support that when multiple candidate forecasts for a target variable are 
available to an analyst, forecast combination often provides more accurate and robust 
forecasting performance in terms of mean square forecast error (MSFE) than using a 
single candidate forecast. The benehts of forecast combination are attributable to the 
facts that individual forecasts often use different sets of information, are subject to model 
bias from different but unknown model misspecihcations, and/or are varyingly affected 


by structural breaks. The review of Timmermann (2006) provides a comprehensive 


account of various forecast combination methods. In particular, one popular method is to 
* Co-first author 
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combine forecasts by estimating a theoretically optimal weight through the minimization 
of mean square error (MSE). For example, Bates and Granger] ( 1969| ) propose to hnd 
the optimal weight using error variance-covariance structure of the individual forecasts. 


Granger and Ramanathan (1984) construct the optimal weight under a linear regression 


framework. 

Despite the ever-increasing popularity and sophistication of combining methods, it 
is repeatedly reported from past literature that the simple average (SA) is a very effec¬ 
tive and robust forecast combination method that often outperforms more complicated 


combining methods (see Winkler and Makridakis (1983), Clemen and Winkler (1986) 


and Diebold and Pauly (1990) for some early examples). In a review and annotated 


bibliography on earlier studies, Clemen (1989) raises the question, “What is the expla¬ 
nation for the robustness of the simple average of forecasts?”. Specihcally, he proposes 
two questions of interest, “(1) Why does the simple average work so well, and (2) un¬ 
der what conditions do other specihc methods work better?” The robustness of SA is 


also echoed in more recent literature. For example. Stock and Watson (2004) build au¬ 
toregressive models with univariate predictors (macroeconomic variables) as candidate 
forecasts for output growth of seven developed countries, and hnd that SA, together with 
other methods of least data adaptivity, is among the top-performing forecast combina¬ 


tion methods. Stock and Watson (2004) further coin the term “Forecast Combination 
Puzzle” (for brevity, we refer to the puzzle as FCP hereafter ), which refers to “the 
repeated Ending that simple combination forecasts outperform sophisticated adaptive 
combination methods in empirical applications”. In another recent example. Genre 


et ah (2013) use survey data from professional forecasters as the individual candidates 


to construct combined forecasts for three target variables. Despite some promising re¬ 
sults of complicated methods, they further note that the observed improvement over SA 
is rather vague when a period of Enancial crisis is included in the analysis. The past 
empirical evidence appears to support the mysterious existence of FCP, which is also 


summarized in Timmermann (2006, section 7.1). 

Many attempts have been made to demystify FCP. One popular and arguably the 
most well-studied explanation for FCP is the estimation error of the combining methods 
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that rely on the optimal weight estimation by MSE minimization. Smith and Wallis 


(2009) rigorously study the estimation error issue. Using the forecast error variance- 
covariance structure, they show both theoretically and numerically that the estimator 
targeting the optimal weight can have large variance and consequently, the estimated 
optimal weight can be very different from the true optimal weight, often even more so 


than simple equal weight. Elliott (2011) studies the theoretical maximal performance 
gain of the optimal weight over SA by optimizing the error variance-covariance structure, 
and points out that the gain is often small enough to be overshadowed by estimation 
error. Timmermann (2006) and Hsiao and Wan (2014) also illustrate conditions for the 


optimal weight to be close to the equal weight so that the relative gain of the optimal 


weight over SA is small. Claeskens et ah (2014) consider the random weight and show 
that when the weight variance is taken into account, SA can perform better than using 


the “optimal” weight. Under linear regression settings, Huang and Lee (2010) discuss 
the estimation error and the relative gain of the optimal weight. 

In addition to estimation error, nonstationarity and structural breaks in the data 
generating process (DGP) are believed to contribute to the unstable performance of the 


estimated “optimal” weight. For example, Hendry and Clements (2004) demonstrate 
that when candidate forecasting models are all misspecihed and breaks occur in the 
information variables, forecast combination methods that target the optimal weight 


may not perform as well as SA. Also, Huang and Lee (2010) propose that the candidate 
forecasts are often weak, that is, they have low predictive content on the target variable, 
making the optimal weight similar to simple equal weight. 

While the aforementioned points are valid and valuable, they do not depict the 
complete picture of the puzzle. In this paper, we provide our perspectives on FCP 
to contribute to its settling. In our view, besides providing explanations of FCP, it is 
also very important to point out the potential danger of recommending SA for broad 
and indiscriminate use. Here, we focus on the mean squared error (MSE). It should 
be pointed out that the main points are expected to stand for other losses as well (e.g., 
absolute error) and that some combination approaches (e.g., AFTER) can handle general 
loss functions. 
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The rest of this article is organized as follows. In section 2, we list some aspects that 
have not been mnch addressed bnt are important towards the nnderstanding of FCP in 
onr view. We formally introdnce the problem setnp of the forecast combination problem 
we consider in section Onr nnderstandings of FCP are elaborated in sections |4]j^ In 
particular, section [^proposes a multi-level AFTER approach to solve FCP. The perfor¬ 
mance of this approach is also evaluated in section using a U.S. Survey of Professional 


Forecasters (SPF) data. A brief conclusion is given in section 10 


2. Additional Aspects of FCP 


The previous work has nicely pointed out that estimation error is an important source 
of FCP and has characterized the impact of the estimation error in idealized settings. 
Indeed, in general, when the forecast combination weighting formula is valid in the sense 
that an optimal weight can be correctly estimated by minimizing MSE, insufficiently 
small sample size may not support reliable estimation of the weight, resulting in inflated 
variance of the combined forecast. The explanation with structural breaks also makes 
sense for certain situations. However, in our view, there are several additional aspects 
that need to be considered for understanding FCP. 


1. A key factor missing in addressing the FCP is the true nature of improvability of 
the candidate forecasts. While we all strive for better forecast performance than 
the candidates, that may not be feasible (at least for the methods considered). 


Thus we have two scenarios (Yang, 2004): i) One of the candidates is pretty much 
the best we can hope for (within the considerations of course) and consequently 
any attempt to beat it will not succeed. We refer to this scenario as “Combin¬ 
ing for Adaptation” (CFA), because the proper goal of a forecast combination 
method under this scenario should be targeting the performance of the best in¬ 
dividual candidate forecast, which is unknown, ii) The other is that a signihcant 
gain of accuracy over all the individual candidates can be materialized. We re¬ 
fer to this scenario as “Combining for Improvement” (CFI), because the proper 
goal of a forecast combination method under this scenario should be targeting 
the performance of the best combination of the candidate forecasts to overcome 
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defects of the candidates. In onr experience, both scenarios occur commonly in 
real problems. Without factoring in this aspect, comparison of different combina¬ 
tion methods may be grossly misleading due to the well-known sin of comparing 
apples to oranges. In our view, empirical studies on forecast combinations in the 
future need to bring this lurking aspect into the analysis. With the above forecast 
combination scenarios spelled out, a natural question follows: Can we design a 
combination method to bridge the two camps of methods proposed for the two 
scenarios respectively, so as to help solve the FCP? 

2. The methods being examined in the literature on FCP are mostly specific choices 
(e.g., least squares estimation). Can we do better with other methods (that may 
or may not have been invented yet) to avoid the heavy estimation price? Also, 
the currently investigated methods often assume the forecasts are unbiased and 
the forecast errors are stationary, which may not be proper for many applications. 
What happens when these assumptions fail? 

3. It has been stated in the literature that the simple methods (e.g., SA) are ro¬ 
bust based on empirical studies. We feel this is not necessarily true in the usual 
statistical sense (rigorously or loosely). In many published empirical results, the 
candidate forecasts were carefully selected/built and thus well-behaved. Therefore, 
the hnding in favor of robustness of SA may be proper only for such situations that 
the data analyst has extensive expertise on the forecasting problem and has done 
quite a bit of work on screening out poor/un-useful candidates. We argue that it 
is much more desirable to investigate FCP broadly so as to allow the possibility of 
poor/redundant candidates for wider and more realistic applications. It should be 
added that in various situations, the screening of forecasts is far from an easy task 
and its complexity may well be at the same level as model selection/averaging. 
Therefore, even for top experts, the view that we can do a good job in screen¬ 
ing the candidate forecasts and then simply recruit SA is overly optimistic. With 
the above, an important matter is to examine the robustness of SA in a broader 
context. 
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As is described in the first item, there are two distinct scenarios: CFA and CFI. 
The CFA scenario can happen if one of the candidate forecasts is based on a model 
sophisticated enough to capture the true DGP (yet still relatively simple), and/or the 
other candidate forecasts only add redundant information. The CFI scenario can of¬ 
ten happen when different candidate forecasts use different information, and/or their 
underlying models have misspecihcations in different ways. 

There are different existing combining methods designed for the two scenarios. The 
methods for the CFI scenario typically seek to estimate the optimal weight aggressively, 
and their examples include variance-covariance based optimization ([Bates and Granger 


1969) and linear regression (Granger and Ramanathan, 1984). These methods are likely 


to suffer from estimation error, causing unstable performance relative to SA. On the other 
hand, the combining methods for the CFA scenario should ideally perform similarly to 
the best individual candidate forecast and should not be subject as severely to estimation 
error as the methods for CFI. The typical methods suitable for the CFA scenario include 


AIC model averaging (Buckland et ah, 1997) and Bayesian model averaging (e.g., Garratt 


et ah, 2003), both in parametric settings. The method of AFTER (Yang, 2004) can be 


applied more broadly in parametric and non-parametric settings, regardless of the nature 
of the candidate forecasts. As one of the main contributions in this article, we show that 
the distinction between the two scenarios provides one of the keys to understanding the 
FGP. We will see in section that an analyst who fails to understand and bring in the 
underlying scenarios and specihc types of data when choosing the combining methods 
can incorrectly apply a combining method not designed for the underlying scenario and 
consequently deliver forecasting results worse than other methods (e.g., SA). 

For the questions raised in the second item regarding whether we can avoid the 
estimation price, we cannot fully address them without a proper framework, because for 
any sensible method, one can always hnd a situation to favor it to its competitors. The 
framework we consider with sound theoretical support is through a minimax view: If one 
has a specihc class of combination of the forecasts in mind and wants to target the best 
combination in this class, then without any restriction/assumption on unbiasedness of 
the candidate forecasts and stationarity of the forecast errors, the minimax view seeks 
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a clear understanding of the minimum price we have to pay no matter what method 
(existing or not) is used for combining. It turns out that the framework from the 
minimax view is closely related to the forecast combination scenarios discussed in the 


hrst item, and Yang (2004) provides a detailed theoretical exposition of the distinct 
forecast combination scenarios and associated minimax results. 


Indeed, Yang (2004) shows that from a minimax perspective, because of the aggres¬ 
sive target set for the CFI scenario, we have to pay an unavoidably heavier cost than 
the target set under the CFA scenario. Specihcally, if we let K denote the number of 


forecasts and T denote the forecasting horizon, Yang (2004) shows that when the target 
is to hnd the optimal weight to minimize the general empirical risk over a set of weights 
satisfying a convex constraint (which is appropriate under the CFI scenario), the estima¬ 
tion cost is ( 9 ( -^^°g(y~'^/^) ) fQj. relatively large T (T > iF^), and 0(log(iF)/ a/T logT) for 
relatively small T (T < iF^). In contrast, if the target is to match the performance of the 
best individual forecast (which is appropriate under the CFA scenario), the estimation 
cost is only 0(log(iF)/T). 

Because of the unavoidable heavy cost under the CFI scenario, it is not always 
ideal to pursue the aggressive target of the optimal weight. Indeed, even if the optimal 
weight gives better performance than the best individual candidate, the improvement 
may not be enough to offset the additional estimation cost (i.e., increased variance) as 
precisely (in minimax rate) identihed in [Yang ( 2004[ ) and Wang et al. (2014). As another 
contribution of our work, we show in section that an appropriately constructed forecast 
combination strategy can perform in a smart way according to the underlying CFI or 
CFA scenario. If CFI is the correct scenario, the proposed strategy can behave both 
aggressively and conservatively so that it performs similar to SA when SA is much better 
than e.g., the linear regression method. 

Besides the estimation error and the necessary distinction of underlying scenarios 
discussed in the hrst two items, the following three reasons can also contribute to FCP. 
First, the weighting derivation formula used by complicated methods is often not suitable 
for the situation. For example, under structural breaks, old historical data no longer 
hold support for a valid optimal weighting scheme, and the known justihcation of well- 
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established combining methods fails as a result. Indeed, Hendry and Clements (2004) 
demonstrate that when candidate forecasting models are all misspecihed and breaks 
occur in the information variables, methods that estimate the optimal weight may not 
perform as well as SA. In section our Monte Carlo examples also show that SA may 
dominate the complicated methods when breaks occur in DGP dynamics. Second, it is 
common practice that the candidate forecasts are already screened in some ways so that 


they are more or less on an equal footing. For example. Stock and Watson (1998) and 


Stock and Watson (2004) apply various model selection methods such as AIC and BIC 


to identify promising linear or nonlinear candidate forecast models. Recently, Bordignon 


et al. (2013) select models of different types (ARMAX, time-varying coefficients, etc.) 


and suggest that SA works well when combining a small number of well-performing 
forecasts. In studies using survey data of professional forecasters, it is also expected 
that each professional forecaster performs some model screening before satisfactorily 
settling down with their own forecast. In these cases, there may not be particularly 
poor candidate forecasts, and the the candidates (at least the top ones) may tend to 
contribute more or less equally to the optimal combination, making SA a competitive 
method. In section we use Monte Carlo examples to show that screening can be a 
source of FCP. Lastly, the puzzle can also be a result of publication bias; people do not 
tend to emphasize the performance of SA when SA does not work well. 

With all our understandings of FCP discussed above, we address the issues raised 
in the third item and provide further information on robustness of SA in sections 

In particular, we will see that SA is actually not robust in performance in several 
directions: its performance may change signihcantly or even substantially when i) an 
optimal, poor or redundant forecast is added; or ii) the degree of the screening of the 
candidate forecasts is done differently. In addition, the size of the rolling window to deal 
with structural breaks affects the relative performance of SA as well. Fortunately, as 
will be seen, some combination methods can largely avoid these defects. 















3. Problem Setup 


Suppose that an analyst is interested in forecasting a real-valued time series yi,y 2 , - ■ ■ ■ 
Given each time point t > 1, let be the (possibly multivariate) information variable 
vector revealed prior to the observation of yt- The may not be accessible to the analyst. 
Conditional on xj and zt_i =: {(xj,|/j), 1 < j < t —1}, yt is subsequently generated from 
some unknown distribution pt(-|xj, Zf_i) with conditional mean rrit = E(|/i|xi, Zt_i) and 
conditional variance Vt = Var(|/t|x 4 , Zf_i). Then, yt can be represented as yt = mt + et, 
where et is the random noise with the conditional mean and the conditional variance 
being 0 and Vt, respectively. 

Assume that prior to the observation of yt, the analyst has access to K real-valued 
candidate forecasts yt,i ii = 1, •'' )^)- These forecasts may be constructed with dif¬ 
ferent model strnctnres, and/or with different components of the information variables, 
but the details regarding how each original forecast is created may not be available in 
practice and are not assumed to be known. The analyst’s objective in (linear) forecast 
combination is to construct a weight vector w = {wi,--- ,wkY ^ ^ based on the 

available information prior to the observation of yt, to hnd a point forecast of yt by 
forecast combination yt,v/ = Wiyt,i- The weight vector may be different at different 
time points. 

To gauge the performance of a procednre that prodnces forecasts {yt,t = 1,2,... } 
given time horizon T, we consider the average forecast risk 

t=l 

in onr analysis and simulation stndies. For real data evaluation, since the risk cannot 
be computed, we use the mean sqnare forecast error (MSFE) as a substitute: 

1 T 

MSFET==5^(!/,-fty 

t=l 

According to the FCP, simple methods with little or no time variation in weight w 
(e.g., eqnal weighting) often ontperform complicated methods with mnch time variation 
in terms of Rt and MSFE^. 
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4. CFA versus CFI: A Hidden Source of FCP 


In this section, we stndy the performance of forecast combination methods nnder 
the two distinct scenarios. Failure to recognize these scenarios can itself result in the 
FCP. We use two simple but illustrative Monte Carlo examples under regression settings 
similar to those of Huang and Lee| (2010) to demonstrate the CFA and CFI scenarios. 


Case 1. Suppose yt (t = 1, ■■■ ,T) is generated by the linear model 


yt = Xtl3 + St, 

where are i.i.d. N{0,ax), and e^s are independent of XtS and are i.i.d. 
A(0,(T^). Consider the two candidate forecasts generated by 

Forecast 1: yt,i = Xt(3t] 

Forecast 2: yt ,2 = 

where /3t and at are both obtained from the ordinary least square (OLS) estimation 
using historical data. 


Given that Forecast 1 essentially represents the true model, its combining with Forecast 2 
cannot improve over the performance of the best individual forecast asymptotically, thus 
giving an example of the CFA scenario. Let Tq be a fixed start point of the evaluation 
period, and let T be the end point. Given the evaluation period from Tq to T, let Rt,Ii 
Rt ,2 and i?T,w be the average forecast risks of Forecast 1, Forecast 2 and the combined 
forecast, respectively. If we let Rt,sa be the average forecast risk at time T for SA, we 
expect that Rt,sa > Rt,i- Indeed, Proposition in the Appendix shows 


Rt,\ 


-A- 


a 


as T —)■ cx). 


( 1 ) 


Rt,sa cr2 + /?V^/4 

and asymptotically, the optimal combination assigns all the weight on Forecast 1. 

Under the CFA scenario, since the best candidate is unknown, the natural goal of 
forecast combination is to match the performance of the best candidate. 
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Case 2. Suppose yt (t = 1, ■■■ ,T) is generated by the linear model 

yt = + xt, 2 ) /5 + et, 

where the = {xt^i,Xt^ 2 )'^ are i.i.d. following a bivariate normal distribution with 
mean 0 and common variance Let p denote the correlation 

between xtp and Xt^ 2 - The random error ets are independent of x^s and are i.i.d. 
N{0,a‘^). Consider the two candidate forecasts generated by 

Forecast 1: ytp = XtpPtp] 

Forecast 2: yt ^2 = a:*, 2 A, 2 , 

where (dtp and j3t,2 are both obtained from OLS estimation with historical data. 


Different from Case 1, Case 2 presents a scenario where each candidate forecast employs 
only part of the information set. It is expected, to some extent, that combining the 
two forecasts works like pooling different sources of important information, resulting 
in performance better than either of the candidate forecasts. By dehning the average 
forecast risks -Rr, 2 , -Rr,SA the same way as in Case 1, we can see from Proposition]^ 
in the Appendix that 


Rtp 


-)■ 




as T —)■ cx). 


( 2 ) 


Rt, SA ' _ p2)('x _ p)/2 + cr2 

Clearly, when the two information sets are not highly correlated, SA can improve the 
forecast performance over the best candidate. This case gives a typical example of the 
CFI scenario, and it is appropriate to seek the more aggressive goal of hnding the best 
linear combination of candidate forecasts. 

Our view is that discussion of the FCP should take into account the different com¬ 
bining scenarios. Next, we perform Monte Carlo studies on the two cases to provide 
an explanation of the puzzle. Combining methods suitable for the CFA scenario have 
been developed to target performance of the best individual candidate. In our numer¬ 


ical studies, we choose the AFTER method (Yang, 2004) as the representative, and it 


is known that AFTER pays a smaller estimation price than methods that target the 
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optimal linear or convex weighting. In contrast, combining methods for the CFI sce¬ 
nario usually attempt to estimate the optimal weight. We choose linear regression of 
the response on the candidate forecasts (LinReg) as the representative. The method of 


Bates and Granger (1969) without estimating correlation (BG for brevity) is used as an 


additional benchmark. 

For Gase 1, we perform simulations as follows. Set cr^ = cr^ = 1. Gonsider a 
sequence of 20 /3’s such that the corresponding signal-to-noise (S/N) ratios are evenly 
spaced between 0.05 and 5 in the logarithmic scale. For each /?, we conduct the following 
simulation 100 times to estimate the average forecast risk. A sample of 100 observations 
is generated. The hrst 60 observations are used to build the candidate forecast models, 
which are subsequently used to generate forecasts for the remaining 40 observations. 
Forecast combination methods including SA, BG, AFTER and LinReg methods are 
applied to combine the candidate forecasts, and the last 20 observations are used for 
performance evaluation. The average forecast risk of each forecast combination method 
is divided by that of SA to obtain the normalized average forecast risk (denoted by 
normalized Rt)- The results are summarized in Figure[^ For Gase 2, we set (3 = f3i = (32, 
p = 0 and = ax-^ = o'x^ = 1. The remaining simulation settings are the same as Gase 
1. The normalized average forecast risks (relative to SA) are summarized in Figure]^ 

In Gase 1, it is clear from Figure [T] that AFTER is the preferred method of choice 
under the GEA scenario. LinReg, on the other hand, consistently underperforms com¬ 
pared to AFTER. Interestingly, when S/N is relatively low (less than 0.35), we observe 
the “puzzle” that LinReg performs worse than SA, which is due to the weight estimation 
error. If the analyst correctly identihes that it is the GFA scenario and applies a cor¬ 
responding method like AFTER, the “puzzle” disappears: AFTER can perform better 
than (or very close to) SA, while LinReg fails. 

In Gase 2, if the analyst applies AFTER without realizing the underlying GFI sce¬ 
nario, we observe the “puzzle” that SA outperforms AFTER. The “puzzle” is not entirely 
surprising since AFTER is designed to target the performance of the best individual 
forecast, while ([^ shows that SA can improve over the best individual forecast. LinReg 
appears to be the correct method of choice when S/N ratio is relatively high. However, 
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Figure 1: (Case 1) Comparing the average forecast risk of different forecast combination 
methods (dashed line represents the SA baseline; x-axis is in logarithmic scale). 


similar to what is observed in Case 1, LinReg suffers from weight estimation error when 
S/N ratio is low, once again giving the “puzzle” that LinReg performs worse than SA. 
Case 2 also shows the interesting observation that it is not always optimal to apply SA 


even when SA is the “optimal” weight in a restricted sense. Indeed, (A.2) and (A.3) in 
Proposition [pimply that if we adopt the common restriction that the sum of all weights 
is 1, SA is the asymptotic optimal weight. However, if we impose no restriction on the 
weight range, the asymptotic optimal weight assigns a unit weight to each candidate 
forecast. This explains the advantage of LinReg over SA in Case 2 when the S/N ratio 
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Figure 2: (Case 2) Comparing the average forecast risk of different forecast combination 
methods (dashed line represents the SA baseline; x-axis is in logarithmic scale). 

is large. 

The observations above illustrate that different combining methods can have strik¬ 
ingly different performance depending on the underlying scenario. The FCP can appear 
when a combining method is not properly chosen according to the correct scenario. 
Without knowing the underlying scenario, comparing these methods may not provide 
a complete picture of FCP, and blindly applying SA may result in sub-optimal perfor¬ 
mance. We advocate the practice of trying to identify the underlying scenario (CFA 
or CFI) when considering forecast combination. It should be pointed out that when 
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the relevant information is limited, it may not be feasible to confidently identify the 
forecast combination scenario. In snch a case, a forced selection, similar to the compar¬ 


ison of model selection and model combining (averaging) described in Ynan and Yang 


(2005), wonld induce enlarged variability of the resulting forecast. A better solution is 


an adaptive combination of forecasts as illustrated in the next section. 


5. Multi-level AFTER 

With the understanding in section we see that when considering forecast combi¬ 
nation methods, an effort should be made to understand whether there is much room 
for improvement over the best candidate. When this is difficult to decide or impractical 
to implement due to handling a large number of quantities to be forecast in real time, 
we may turn to the question: Can we hnd an adaptive (or universal) combining strategy 
that performs well in both CFA and CFI scenarios? Note that here adaptive refers to 
adaptation to the forecast combination scenario (instead of adaptation to achieving the 
best individual performance). Another question follows: Under the CFI scenario, can 
the adaptive combining strategy still perform as well as SA when the price of estimation 
error is high? As we have seen in Case 2 of section]^ using methods (e.g., LinReg) 
intended for the CFI scenario alone cannot successfully address the second question. 

It turns out that the answers to these two questions are affirmative. The idea is 


related to a philosophical comment in Clemen et ah (1995): 


“Any combination of forecasts yields a single forecast. As a result, a particular combi¬ 
nation of a given set of forecasts can itself be thought of as a forecasting method that 
could compete... ” 

The use of combination of forecast (or procedure) combinations is a theoretically pow¬ 


erful tool to achieve adaptive minimax optimality (see, e.g., Yang (2004), Wang et ah 


(2014)). In the context of our discussion, combined forecasts such as SA, AFTER and 


LinReg can all be considered as the candidate forecasts and may be used as individual 
candidates in a forecast combination scheme. 

Accordingly, we design a two-step combining strategy: hrst, we construct three new 
candidate forecasts using SA, AFTER and LinReg; second, we apply the AFTER al- 
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gorithm on these new candidate forecasts to generate a combined forecast. We refer 
to this two-step algorithm as multi-level AFTER (or mAFTER for short) because two 
layers of AFTER algorithms are involved. The key lies in the AFTER algorithm on 
the second step, which allows mAFTER to automatically target the performance of the 
best individual candidate among SA, AFTER and LinReg. Under the CEA scenario, 
mAFTER can perform as if we are using AFTER alone considering that AFTER is the 
proper method of choice. Under the CFI scenario, mAFTER can perform closely to 
the better of SA and LinReg. Thus, when LinReg suffers from severe estimation error, 
mAFTER will perform closely to SA and thereby avoid the high cost. 

Indeed, if we denote the forecasts generated from SA, LinReg and mAFTER by 
and y\^\ respectively, we have Proposition as follows. 


Proposition 1. Under the regularity conditions shown in the Appendix, the average 
forecast risk of the mAFTER strategy satisfies 


i Y, E(» - yr? 

t=To 


< inf ( inf — 
\l<i<K T 


T 


Hvt - yt,iY + 

t=To 


Cl iog(A:) 
T 


i Y E(» - yur -+ 

t=To 


T’ 


^YE(y.-yrr+f), 

t=To 


where Ci and C 2 are some positive constants not depending on the time horizon T. 


Proposition is a consequence of Theorem 5 in Yang (2004). It indicates that, in 
terms of the average forecast risk, mAFTER can match the performance of the best 
original individual forecast, the SA forecast and the LinReg forecast (whichever is the 
best), with a relatively small price of order at most log(iF)/T. 

To conhrm that the mAFTER strategy can solve the “puzzles” illustrated in the 
previous section, we repeat the simulation studies of Case 1 and Case 2 and summarize 
the results in Figure and Figure respectively. In Case 1, it suffices to see that 
mAFTER correctly tracks the performance of AFTER. In Case 2, when S/N is rela¬ 
tively large (> 0.5), mAFTER takes advantage of the opportunity to improve over the 
original individual forecasts and performs very closely to LinReg; when S/N is relatively 
small (< 0.5), mAFTER behaves very similarly to SA and successfully avoids the heavy 
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estimation error suffered by LinReg. Therefore, rather than relying on SA, a “sophisti¬ 
cated” combining strategy like mAFTER can be an appealingly safe method that avoids 
FCP. 

Note that mAFTER is a rather general forecast combination strategy. In the first 
step of the strategy, the analyst can choose their own way of generating new candidate 
forecasts (not necessarily restricted to AFTER and LinReg), as long as they include SA, 
representative methods for the CFA scenario, and representative methods for the CFI 
scenario. AFTER and LinReg are simply chosen in our study as convenient representa¬ 
tives. We also demonstrate the performance of the mAFTER strategy in the real data 
example in section 


6. Is SA Really Robust? 

The SA has been praised for being robustly among top performers relative to other 
forecast combination methods. It is obvious that SA cannot be robust in the traditional 
statistical sense: even a single really bad candidate can damage the performance of the 
combined forecast to an arbitrarily worse position. A more interesting question is to 
assess robustness of SA in practically relevant settings. 

The previous two sections have shown that SA is not robust in terms of its relative 
performance when dealing with the two different scenarios. In this section, we show that 
SA is not robust even in the loose sense when new forecast candidates are added to the 
candidate pool, especially if the new candidates have only redundant information with 
respect to the original candidate pool. In contrast, the AFTER-type combining methods 
can be rather robust against adding poor or redundant candidate forecasts. Here, we 
consider the following three cases. 

Case 3. Suppose a new information variable Xt ,3 has the same distribution as Xt^i, and 
is independent of zt-i and (xt^i,Xt, 2 )- A new candidate forecast yt ,3 = Xt, 3 $t ,3 joins 
the candidate pool in Case 2, where (3t^3 is obtained from OLS estimation with 
historical data. 

Case 4. A new candidate forecast yt ,3 = Xt,2/3t,2 identical to Forecast 2 joins the candi- 
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Figure 3: (Case 1) Performance of mAFTER under adaptation scenario (dashed line 
represents the SA baseline; x-axis is in logarithmic scale). 

date pool in Case 2. 

Case 5. A new candidate forecast yt,z = aitgAg is generated using a transformed infor¬ 
mation variable Xt ^2 = exp(a;t^ 2)5 where Ag is obtained from OLS estimation with 
historical data. 

Note that the new candidate in Case 3 is a very poor forecast, while the new candi¬ 
dates in Case 4 and Case 5 contain a subset of the information variables. In all of the 
cases above, no new information is added to the candidate pool. Following the same 
simulation setting as Case 2, we focus on SA and AFTER and compute the ratio be- 
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Figure 4: (Case 2) Performance of mAFTER under improvement scenario (dashed line 
represents the SA baseline; x-axis is in logarithmic scale). 

tween the MSFE after adding the new candidate and the MSFE in Case 2. Figure 
shows that the performance of AFTER remains almost the same, while the performance 
of SA worsens after adding the non-informative or redundant candidate forecasts. 

7. Improper Weighting Formulas: A Source of the FCP Revisited 

Generally speaking, the popular forecast combination methods often implicitly as¬ 
sume that the time series and/or the forecast errors are stationary. It is expected in 
theory that they should perform well if we have access to long enough historical data. 
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Figure 5: Studying the robustness of SA against adding new candidate forecasts. 


In practice, however, such derived weighting formulas can often be unsuitable when the 
DGP changes and the candidate forecasts cannot adjust quickly to the new reality. For 
example, it is often believed that structural breaks can unexpectedly happen, making 
the relative performance of the candidate forecasts unstable and giving us the impression 
that SA performs well. 

Next, we use a Monte Carlo example to illustrate the FCP under structural breaks. 


Rather than assuming deterministic shifts in information variables (Hendry and Clements 


2004), we consider breaks in the DCP dynamics: 


J2k=i l^i,kyt-k + c 


if 1 < t < 50, 


yt = 


P2,iyt-i + P2,2yt-2 if 51 < f < 100, 

P^,iyt-1 + St if 101 < t < 150, 


where the coefficients [3j^k {j = 1,2,3) are randomly generated from the uniform distri¬ 
bution on (0,1), and EfS are i.i.d. N{0, 1). Here, structural breaks happen at t = 50 and 
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t = 100. The candidate forecast models are autoregressions from lag 1 to lag 6, and we 
apply SA, BG, LinReg and AFTER to generate the combined forecasts. The simulation 
is repeated 100 times, and the last 100 time points serve as the evaluation period to 
obtain the average forecast risk. For comparison, we consider BG, LinReg and AFTER 
methods with estimation rolling window size rw = 20 or 40, meaning only the most 
recent rw observations are used to estimate the weights for each forecast. The results 
are summarized in Table The average forecast risk is normalized with respect to SA, 
and numbers in parentheses are standard errors. 

Table 1: Gomparing the normalized average forecast risk of different combination meth¬ 
ods under structural breaks. 



SA 

LinReg 

BG 

AFTER 

standard 

1.000 

1.026 (0.011) 

1.005 (0.003) 

1.047 (0.010) 

rw = 40 

1.000 

1.060 (0.033) 

0.992 (0.002) 

0.991 (0.009) 

rw = 20 

1.000 

1.64 (0.42) 

0.980 (0.003) 

0.952 (0.007) 


We can see from Table that all three standard combining methods, when finding 
weights using all historical data, underperform compared to SA due to the unstable 
relative performance of candidate forecasts. As we shrink the estimation window size 
to the most recent 40 and 20 time points, BG and AFTER achieve better performance 
than SA while the performance of LinReg worsens. This result can be understood by 
noting that there are two opposing factors when we shrink the weight estimation window. 
When using only the most recent forecasts, we decrease the bias of the weighting formula 
supported by the old data but simultaneously increase the variance of the estimated 
weight. Among the three methods considered, the estimation error factor dominates for 
LinReg. On the other hand, AFTER is not designed to aggressively target the optimal 
weight, thus benehting the most from the shrinking rolling window. 

Due to the complex impact of structural breaks on forecast combination methods, it 
is arguably true that the focus should be made on how to detect the problem (see, e.g.. 
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and how to come up with new com¬ 
bining forms accordingly (e.g., using the most recent observations to avoid an improper 
weighting formula). However, proper identihcation of structural breaks can be difficult 
to achieve in practice, and this example shows that in the presence of structural breaks, 
the relative performance of SA is not as robust as BG and AFTER with naively chosen 
rolling windows. 


Altissimo and Corradi, 2003 Davis et al., 2006 


8. Linking Forecast Model Screening to FCP 


In empirical studies, the candidate forecasting models are often screened/selected in 
some way to generate a smaller set of candidates for combining. As is demonstrated in 
Case 3 of section]^ the performance of SA is particularly susceptible to poor-performing 
candidate models. The common practice of model screening may contribute to improving 
the performance of SA. 

Next, we illustrate the impact of screening with a Monte Carlo example. Let xt G MP 
{p = 20) be the p-dimensional information variable vector randomly generated from a 
multivariate normal distribution with mean 0 and covariance S, where (S)jj = 
and p = 0 or 0.5. Consider a DGP with linear model setting 

yt = xf/3-hei. 


where coefficient /3 = (3, 3, 2,1,1,1,1, 0, 0, • • • , 0) and St are i.i.d. N{0, a^) with cr = 2 
or 4. Under this setting, only the first 7 variables in xj are important for yt, while the 
remaining variables are redundant. 

If we assume that the analyst has full access to the information vector x^’s, we 
may build linear models as the candidate forecasts with any subset of the information 


variables. It is known from Wang et al. (2014) that if we select the best subset model 


with the right size using the ABC criterion (Yang, 1999) or combine the subset regression 


models by proper adaptive combining methods (Yang, 2001), the prediction risk can 
adaptively achieve the minimax optimality over soft and hard sparse function classes. 
Inspired by this result, we consider the following screening-and-combining approach. 
First, given the model size (that is, the number of information variables used in a 
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candidate linear model), choose the best OLS model based on estimation mean sqnare 
error. Second, from the p models selected from the hrst step, hnd the top X% {X = 
10, 20,40,60, 80) of the models based on the ABC criterion. Note that the ABC criterion 
for a snbset model with size r is ABC{r) = ~ yt,rY + 2ra‘^ + log (^), where 

n is the estimation sample size, yt^r is the htted response, and cx^ can be replaced by 
the estimation mean sqnare error. The remaining snbset models after the two-step 
screening are nsed to bnild the candidate forecasts for combining. In simnlation, the 
total time horizon is set to be 200. The screening procednres are applied to the hrst 100 
observations, and the remaining models are nsed to bnild the candidate forecasts for the 
latter 100 time points. Different forecast combination methods are applied, and their 
performances are evalnated nsing the last 50 observations. The simnlation is repeated 
100 times, and the normalized average forecast risk (relative to SA) is snmmarized in 
Table [U 


Tablej^shows that AFTER ontperforms all the other competitors, inclnding SA. This 
is consistent with onr nnderstanding of a typical CFA scenario, nnder which AFTER is 
the proper choice of combining methods. However, as we decrease X and select smaller 
sets of candidate forecasts for combining, the performance of SA gradnally approaches 
that of AFTER. Snch a resnlt is not entirely snrprising considering that when only 
the top few models are selected, simply averaging them can perform similarly to the 


optimal resnlts obtained by the proper snbset selection or combination methods (Wang 


et ah, 2014). LinReg, which is not a proper choice nnder the CFA scenario, appears to 


nnderperform compared to SA. As X decreases, LinReg becomes less snbject to weight 
estimation error, and the performance of LinReg improves relative to SA. 

From this example, we can see that the performance of SA is not robnst to the de¬ 
gree of screening. Generally, it is a very challenging task to ensnre an optimal screening 
to make SA perform well. As a resnlt, althongh SA works relatively well in this par¬ 
ticular example for aggressive screening (keeping very few candidates), SA should not 
be preferred in general. Without a good screening/selection rule, it leaves too much 
freedom for the analyst to make poor decisions. We note that a possible solution is to 
hrst create new candidate forecasts (e.g., forecasts generated by linear regression meth- 
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Table 2: Comparing the normalized average forecast risk of different forecast combina¬ 
tion methods after the screening procedure. 


Top X% 

10% 

20% 

40% 

60% 

80% 


a = 2, p 

= 0 



AFTER 

0.998 

0.989 

0.966 

0.951 

0.945 

BG 

1.000 

0.999 

0.997 

0.997 

0.996 

LinReg 

1.017 

1.024 

1.056 

1.098 

1.151 


a 

= 2,p-- 

= 0.5 



AFTER 

0.996 

0.990 

0.968 

0.956 

0.951 

BG 

1.000 

0.998 

0.997 

0.997 

0.996 

LinReg 

1.013 

1.024 

1.043 

1.095 

1.159 


a 

= 4, p = 

= 0.5 



AFTER 

0.994 

0.987 

0.984 

0.981 

0.974 

BG 

0.999 

0.998 

0.998 

0.998 

0.997 

LinReg 

1.002 

1.012 

1.056 

1.101 

1.163 


a 

= 4, p = 

= 0.5 



AFTER 

0.995 

0.990 

0.976 

0.969 

0.961 

BG 

1.000 

0.999 

0.998 

0.997 

0.997 

LinReg 

1.004 

1.010 

1.030 

1.086 

1.136 


ods) to utilize most or all of the important information, and then the roles of a good 
screening/selection rule can be played by applying the multi-level AFTER approach 
(introduced in section on both the original forecasts and the combined forecasts to 
reduce the influence of the poor-performing or redundant forecasts. 
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9. Real Data Example 


In this section, we stndy the U.S. SPF (Society of Professional Forecasters) dataset to 
evaluate SA and the mAFTER strategy. This dataset is a quarterly survey on macroe¬ 


conomic forecasts in the United States. Lahiri et ah (2013) nicely handled the missing 
forecasts by adopting two missing forecast imputation strategies known as the regres¬ 
sion imputation (REG-Imputed) and the simple average imputation (SA-Imputed) to 


generate the complete panels. As pointed out by Lahiri et ah (2013), the change of data 
administration agency in 1990 and the subsequently shifting missing data pattern make 
it difficult to use the entire data period for meaningful evaluation. Therefore, we inherit 
their missing forecast imputation as well as the forecast selection strategies, and focus 
on the period from 1968:Q4 to 1990:Q4 to evaluate the performance of the mAFTER 
strategy. 

Three macroeconomic variables are considered: seasonally-adjusted annual rate of 
change for GDP price deflator (PGDP), growth rate of real GDP (RGDP) and quarterly 
average of monthly unemployment rate (UNEMP). The datasets for RGDP and PGDP 
have 14 candidate forecasts, and the datasets for UNEMP have 13 candidate forecasts. 
Each forecast provides (^-quarter (^f = 1,2, 3,4) ahead forecasting. We apply SA, AF¬ 
TER, BG, LinReg and mAFTER to each SPF dataset of a macroeconomic variable with 
a given missing forecast imputation method. Each forecast combination method uses the 
hrst one fourth of the total time horizon to build up the initial weights, and the remain¬ 
ing time points are used to calculate the normalized MSFE of each method relative to 
SA. By taking the average over the four MSFEs that correspond to the 1,2,3,4-quarter 
ahead forecasting, we summarize the performance of different combining methods in 
Table [H 


From Table although AFTER performs quite differently with different target 
macroeconomic variables, the mAFTER strategy delivers overall robust performance for 
all three variables. For PGDP, AFTER performs the besty and beats SA by as much as 
10%. Using mAFTER successfully maintains this advantage over SA. For RGDP, while 
SA and BG beat AFTER by up to 13%, mAFTER successfully pulls the performance to 
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Table 3: Comparing the performance of forecast combination methods with SPF datasets 
(values shown are normalized MSFEs averaged over 1,2,3,4-quarter ahead forecasting). 


Target Variable 

SA 

LinReg 

BG 

AFTER 

mAFTER 

REG-imputed 

PGDP 

1.00 

1.88 

0.95 

0.90 

0.90 

RGDP 

1.00 

1.64 

1.00 

1.11 

1.01 

UNEMP 

1.00 

1.79 

0.99 

0.98 

0.98 

SA-imputed 

PGDP 

1.00 

2.17 

0.98 

0.95 

0.95 

RGDP 

1.00 

1.83 

1.00 

1.13 

1.03 

UNEMP 

1.00 

1.69 

0.99 

0.97 

0.98 


within 3% of SA. Finally, for the UNEMP variable, SA, BG and AFTER all perform very 
similarly with no more than a 3% difference, and the performance of mAFTER does not 
deviate much from either SA or AFTER. The LinReg method that aggressively pursues 
the optimal weight performs poorly for all three target variables. It is interesting to 
note from Figure that for both PGDP and RGDP variables, the largest performance 
difference between SA and AFTER is found in the one-quarter ahead forecasting; in 
each case, mAFTER robustly matches the better of SA and AFTER. 

10. Conclusions 

Inspired by the seemingly mysterious FCP, we provide our explanations of why the 
puzzle often occurs and investigate when a sophisticated combining method can work well 
compared to the simple average (SA). Our study illustrates that the following reasons 
can contribute to the puzzle. 

First, estimation error is known to be an important source of FCP. Both theoretical 
and empirical evidence show that a relatively small sample size may prevent some com- 


26 








PGDP 


RGDP 




Figure 6: Comparing normalized MSFEs of different forecast combination methods with 
REG-Imputed SPF datasets. Left panel: PGDP variable. Right panel; RGDP variable. 
For each method, the bars from left to right represents 1,2,3,4-quarter ahead forecasting 
results, respectively. The dashed line represents the SA baseline. 

bining methods from reliably estimating the optimal weight. Second, FGP can appear 
if we apply a combining method without consideration of the underlying data scenarios. 
The relative performance of SA may depend heavily on which scenario is more proper 
for the data. Third, the weighting formula of the combining methods is not always ap¬ 
propriate for the data, because structural breaks and shocks can unexpectedly happen. 
The weighting formula obtained by sophisticated methods cannot adjust fast enough to 
the reality, resulting in performance less stable than SA. Fourth, candidate forecasts are 
often screened in some way so that the remaining forecasts used for combining tend to 
have similar performance, and SA may tend to work well in such cases. However, SA can 
be sensitive to the screening process, and enlarging the pool of candidates may beneht 
other combination methods; therefore, empirical observations that SA works well after 
model screening should be taken with a grain of salt. Fifth, there may be publication 
bias in that people tend to report the existence of FGP when SA gives good empirical 
results but may not emphasize the performance of SA when it gives mediocre results. 
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Regarding the first two reasons above, our study shows that it is not hard to find data 
and build candidate forecasts in a certain way to favor a sophisticated or simple method. 
Under the CFA scenario, we realize that the heavy estimation price can be avoided by 
applying combining methods designed to target the performance of the best candidate 
forecast. Under the CFI scenario, although past literature has properly pointed out the 
potentially high cost of estimation error when targeting the optimal weight, it turns out 
that we do not have to pay the high cost. Indeed, a carefully designed mAFTER strategy 
can perform aggressively to target the optimal weight when information is sufficient to 
support exploiting the optimal weighting and perform conservatively like SA when the 
degree of estimation error is high. mAFTER can also intelligently perform according 
to the underlying scenario (CFA or CFI), avoiding the puzzle caused by improperly 
choosing the combining methods. 

SA certainly can be the best or among the top combining methods, as observed 
empirically and reported in the literature. It may be particularly useful when one can 
legitimately narrow the focus to just a few well-behaving candidate forecasts. However, 
since the uncertainty of the process used to reach the small set of candidates is not 
reflected in the showcase examples in the literature, the “conditional” results in favor of 
SA may not be replicable when one starts from scratch with inhomogeneous raw mod¬ 
els/forecasts. For such problems, the performance of SA may span the whole spectrum, 
from terrible to on top of the chart. Also, when information is rich for a stable fore¬ 
casting problem, SA may lose greatly to a model-based method (e.g., regression). In 
contrast, when the analyst has little conhdence in basic modeling assumptions on the 
data or in the quality of the available forecasts, perhaps SA (or the like) would be the 
choice to take. 

The repeatedly reported puzzle in literature tends to give the sentiment that so¬ 
phisticated methods are not trustworthy and simple methods should be used. Based 
on our understanding and the numerical results, it seems fair to say that if the sophis¬ 
ticated methods in those studies do not perform well, it is actually because they are 
not sophisticated enough, not the other way around! In particular, when SA is consid¬ 
ered by mAFTER as a candidate, the possible advantage of SA is retained while the 



un-robustness of SA is avoided. To a large extent, the forecast combination puzzle no 
longer exists if we are able to move forward intelligently by integrating the strengths of 
different combining methods. 


APPENDIX 


A. Assumptions of Proposition 


The following two assumptions are sufficient regularity conditions for Propostion 
Note that Assumption |A. 1| is satished if we truncate the candidate forecasts to have cer¬ 
tain lower and upper bounds. Assumption A.2 is satished if the conditional distributions 
of the random noise are sub-Gaussian. 


Assumption A.l. There exists a positive constant M such that the candidate forecasts 
satisfy with probability 1 that 

sup \mt-ytA<M. 

l<i<K,l<t<T 

Assumption A.2. There exists a constant tq > 0 and continuous functions 0 < 
hi{^),h 2 {r) < oo on [—ro,ro] such that for every I <t <T and r G [—rojro], 

E(|etpexp(r|£4|)|xt,Zf_i) < hi(r), 

E(exp(r|£4|)|xt,Z4_i) < h2{r) 

with probability 1. 


B. Propositions and Proofs 


Proposition 2. Under the settings of Case 1, the average forecast risk of Forecaster 1 
relative to the SA satisfies 


Rta 


-)■ 


a 


as T ^ oo. 


Rt,sa cr ^ + 

In addition, if we consider the weight vectors in the asymptotic optimal combination 
weight w* satisfies 


w* =: arg min lim Rtw = , „ 
wSM^ \T^oo ’ / VO 
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Proposition 3. Under the settings of Case 2, if we assume that jS = jSi = (32 and 
ax = (Txi = ax 2 , the average forecast risk of Forecast i (i = 1,2) relative to the SA 


satisfies 


R 


T,i 




P^) + 


o ^ 2 02(7 2V1 2 asT^oo. (A.l) 

Rt,sa cr^/32(l-p 2 )(i _p)/2 + a2 
In addition, if we further assume p = 0, the asymptotic optimal combination weight w* 
under the restriction 0 = {w ■. wi + W 2 = 1} satisfies 


w 


* 


=: argmin 

wg© 


lim R 

,T^oo 




(A.2) 


and the asymptotic optimal combination weight w* without the restriction satisfies 



(A.3) 


The proof of Proposition is similar to that of Proposition In the following, we 
provide a sketch for the proof of Proposition 


Proof of Proposition\^ Let rT,i = E(|/t rT,i = E{yT-yT, 2 f and vt,^ = E{yT - 

yT,vf)‘^ he the point-wise forecast risks at time T for forecaster 1, forecaster 2 and the 
combined forecast, respectively. We will hrst verify that under the restriction 0 = {w : 

Wi+W2 = 1}, 


— cr 


rT+1,2 — cr 


1 + 


1 a[ 

1 

to 

1 + 

1 1 

1 cr; 

T-2) 


'Xi- 

^X2' 




(^X2 


?^T+i,w = 0-^(1 - Wi - VJ 2 ) + w^rT+1,1 + wlrT+1,2 + 2wiW2(^pax^(Jx2/3i/32{l + E(p)^) 

))' 


ccxi crx2 


pcrxicrx2cc 


E( 


crxiCrx2 


where ax^ = ^be estimated covariate standard deviation {i = 1,2) and 

p = is the estimated covariate correlation. 

r T (jxi 0 x 2 
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First, we have 

^T+1,1 = IE(2/t+i — xt+i^iPt+i,iY 


— E( £t+1 + Xt+1,iI3i + Xt+1,2I32 -=^2^ - 2- j 

z^t=l 

2 , T^f o. o XT+l,lT.Ll^tA^t,l/3l+Xt,2/32+et)\‘^ 

= a + E( xt+i,i/3i + Xt+i, 2 P 2 -^-) 

Z^t=i 


— (T^ + K{xT+l, 2 / 32 y + EmXt+ 1 ,i/? 2)^( 


'2^t=i^t,iXt 


j:u<i 


h')+®( 


4+1,1 (Ef=i^t.igt)' 

(EL4i)^ 


- 2E 


pT+l,ia:T+l,24 Ef=l Xt,lXt,2 


e;=i4i 


= CT^ + + 442E(p^?f^) + - 2pax,(Tx2PiE(^p 

^ Xi 


0'X2 


The expression for r-r+i 2 can be derived similarly. For r^+i^w, we have 

^"T+l,w = E(i/2’+1 — WiyT+1,1 — W2yT+l,2y 

= cr^ + E^iai(a;T+i,i/?i + xt + i , 2(32 — a^T+i,i4+i,i) 

+ W2{xt+1,i(3i + Xt+ 1,2(32 — XT+ipPr+ip) 

= a^{l- wj - wl) + w^rT+1,1 + W2’"r+1,2 

+ 2r(;ir(;2E^(xr+i^i/3i + xt+i, 2I^2 — XT+iph+ip) 

X {XT+I,l(il + XT+l, 2 f 32 — XT+ipPr+ip) 

= : cr^(l - wl - wl) + wlrT+1,1 + 1C2^T+1,2 + 2W1W2A1 
With tedious algebra, it is not hard to show that 

^1 = p(XxiCrx2l^ih{^ + E(p)^) - aE/3i/?2E (p^^ - a\^(3i(32'^ (p 


<XXi_ 


.O'Xi 


P<XXi<XX2<^ 

T 


E 


P 


<XX-i_<XX2 


Together with the previous display, we verify the formula for r'r+i,w The formulas 


(A.l) and (A.2) can be verified straightforwardly by noting that the x^’s are normally 
distributed and that rT,i/i?T,j —)■ 1 as T —)■ 00 (i = 1, 2). When there is no restriction on 
w, r^+i^w can be derived similarly as above. Then, we can show that when w = (1,1)^, 


lim^^oo -Rt,w = 0’^, which implies (A.3). 
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